19TH JUNE,2023
1.1 Description:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_absolute_error
import warnings
warnings.filterwarnings('ignore')
Column attribute Description Patient number Identifies patients by number Cholesterol Total cholesterol Glucose Fasting blood sugar HDL HDL or good cholesterol Chol/HDL Ratio of total cholesterol to good cholesterol. Desirable result is < 5 Age All adult African Americans Gender 162 males, 228 females Height In inches Weight In pounds (lbs) BMI 703 x weight (lbs)/ [height(inches]2 Systolic BP The upper number of blood pressure Diastolic BP The lower number of blood pressure Waist Measured in inches Hip Measured in inches Waist/hip Ratio is possibly a stronger risk factor for heart disease than BMI Diabetes Yes (60), No (330)
This dataset is a modified data from VanderBilt but the original dataset is from a study of rural African American in Virginia. The dataset already has 13 patients dropped due to missing data. It also has two features added, being Chol/HDL and Waist/hip.
df = pd.read_csv('\\Users\\Daniel Appiah\\Desktop\\Diabetes_Classification.csv', index_col='Patient number')
df.head()
| Cholesterol | Glucose | HDL Chol | Chol/HDL ratio | Age | Gender | Height | Weight | BMI | Systolic BP | Diastolic BP | waist | hip | Waist/hip ratio | Diabetes | Unnamed: 16 | Unnamed: 17 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Patient number | |||||||||||||||||
| 1 | 193 | 77 | 49 | 3.9 | 19 | female | 61 | 119 | 22.5 | 118 | 70 | 32 | 38 | 0.84 | No diabetes | 6.0 | 6.0 |
| 2 | 146 | 79 | 41 | 3.6 | 19 | female | 60 | 135 | 26.4 | 108 | 58 | 33 | 40 | 0.83 | No diabetes | NaN | NaN |
| 3 | 217 | 75 | 54 | 4.0 | 20 | female | 67 | 187 | 29.3 | 110 | 72 | 40 | 45 | 0.89 | No diabetes | NaN | NaN |
| 4 | 226 | 97 | 70 | 3.2 | 20 | female | 64 | 114 | 19.6 | 122 | 64 | 31 | 39 | 0.79 | No diabetes | NaN | NaN |
| 5 | 164 | 91 | 67 | 2.4 | 20 | female | 70 | 141 | 20.2 | 122 | 86 | 32 | 39 | 0.82 | No diabetes | NaN | NaN |
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 390 entries, 1 to 390 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Cholesterol 390 non-null int64 1 Glucose 390 non-null int64 2 HDL Chol 390 non-null int64 3 Chol/HDL ratio 390 non-null float64 4 Age 390 non-null int64 5 Gender 390 non-null object 6 Height 390 non-null int64 7 Weight 390 non-null int64 8 BMI 390 non-null float64 9 Systolic BP 390 non-null int64 10 Diastolic BP 390 non-null int64 11 waist 390 non-null int64 12 hip 390 non-null int64 13 Waist/hip ratio 390 non-null float64 14 Diabetes 390 non-null object 15 Unnamed: 16 1 non-null float64 16 Unnamed: 17 1 non-null float64 dtypes: float64(5), int64(10), object(2) memory usage: 54.8+ KB
All features here are numeric except the column named Diabetes.
df.describe()
| Cholesterol | Glucose | HDL Chol | Chol/HDL ratio | Age | Height | Weight | BMI | Systolic BP | Diastolic BP | waist | hip | Waist/hip ratio | Unnamed: 16 | Unnamed: 17 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 390.000000 | 390.000000 | 390.000000 | 390.000000 | 390.000000 | 390.000000 | 390.000000 | 390.000000 | 390.000000 | 390.000000 | 390.000000 | 390.000000 | 390.000000 | 1.0 | 1.0 |
| mean | 207.230769 | 107.338462 | 50.266667 | 4.524615 | 46.774359 | 65.951282 | 177.407692 | 28.775641 | 137.133333 | 83.289744 | 37.869231 | 42.992308 | 0.881385 | 6.0 | 6.0 |
| std | 44.666005 | 53.798188 | 17.279069 | 1.736634 | 16.435911 | 3.918867 | 40.407824 | 6.600915 | 22.859528 | 13.498192 | 5.760947 | 5.664342 | 0.073212 | NaN | NaN |
| min | 78.000000 | 48.000000 | 12.000000 | 1.500000 | 19.000000 | 52.000000 | 99.000000 | 15.200000 | 90.000000 | 48.000000 | 26.000000 | 30.000000 | 0.680000 | 6.0 | 6.0 |
| 25% | 179.000000 | 81.000000 | 38.000000 | 3.200000 | 34.000000 | 63.000000 | 150.250000 | 24.100000 | 122.000000 | 75.000000 | 33.000000 | 39.000000 | 0.830000 | 6.0 | 6.0 |
| 50% | 203.000000 | 90.000000 | 46.000000 | 4.200000 | 44.500000 | 66.000000 | 173.000000 | 27.800000 | 136.000000 | 82.000000 | 37.000000 | 42.000000 | 0.880000 | 6.0 | 6.0 |
| 75% | 229.000000 | 107.750000 | 59.000000 | 5.400000 | 60.000000 | 69.000000 | 200.000000 | 32.275000 | 148.000000 | 90.000000 | 41.000000 | 46.000000 | 0.930000 | 6.0 | 6.0 |
| max | 443.000000 | 385.000000 | 120.000000 | 19.300000 | 92.000000 | 76.000000 | 325.000000 | 55.800000 | 250.000000 | 124.000000 | 56.000000 | 64.000000 | 1.140000 | 6.0 | 6.0 |
df.corr()
| Cholesterol | Glucose | HDL Chol | Chol/HDL ratio | Age | Height | Weight | BMI | Systolic BP | Diastolic BP | waist | hip | Waist/hip ratio | Unnamed: 16 | Unnamed: 17 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cholesterol | 1.000000 | 0.158102 | 0.193162 | 0.475927 | 0.247333 | -0.063601 | 0.062359 | 0.091695 | 0.207741 | 0.166241 | 0.134038 | 0.093364 | 0.091847 | NaN | NaN |
| Glucose | 0.158102 | 1.000000 | -0.158302 | 0.282210 | 0.294392 | 0.098052 | 0.190358 | 0.129286 | 0.162777 | 0.020262 | 0.222336 | 0.138223 | 0.185117 | NaN | NaN |
| HDL Chol | 0.193162 | -0.158302 | 1.000000 | -0.681867 | 0.028210 | -0.087238 | -0.291883 | -0.241860 | 0.031807 | 0.078342 | -0.276697 | -0.223837 | -0.158777 | NaN | NaN |
| Chol/HDL ratio | 0.475927 | 0.282210 | -0.681867 | 1.000000 | 0.163201 | 0.081162 | 0.278812 | 0.228407 | 0.115505 | 0.038242 | 0.313262 | 0.208902 | 0.243329 | NaN | NaN |
| Age | 0.247333 | 0.294392 | 0.028210 | 0.163201 | 1.000000 | -0.082229 | -0.056784 | -0.009164 | 0.453417 | 0.068649 | 0.150585 | 0.004675 | 0.275188 | NaN | NaN |
| Height | -0.063601 | 0.098052 | -0.087238 | 0.081162 | -0.082229 | 1.000000 | 0.255389 | -0.259589 | -0.040704 | 0.043617 | 0.057447 | -0.095906 | 0.252548 | NaN | NaN |
| Weight | 0.062359 | 0.190358 | -0.291883 | 0.278812 | -0.056784 | 0.255389 | 1.000000 | 0.860147 | 0.097497 | 0.166477 | 0.847766 | 0.826985 | 0.250461 | NaN | NaN |
| BMI | 0.091695 | 0.129286 | -0.241860 | 0.228407 | -0.009164 | -0.259589 | 0.860147 | 1.000000 | 0.121408 | 0.145304 | 0.810701 | 0.881728 | 0.100873 | NaN | NaN |
| Systolic BP | 0.207741 | 0.162777 | 0.031807 | 0.115505 | 0.453417 | -0.040704 | 0.097497 | 0.121408 | 1.000000 | 0.603662 | 0.210934 | 0.155321 | 0.137871 | NaN | NaN |
| Diastolic BP | 0.166241 | 0.020262 | 0.078342 | 0.038242 | 0.068649 | 0.043617 | 0.166477 | 0.145304 | 0.603662 | 1.000000 | 0.165846 | 0.143898 | 0.077918 | NaN | NaN |
| waist | 0.134038 | 0.222336 | -0.276697 | 0.313262 | 0.150585 | 0.057447 | 0.847766 | 0.810701 | 0.210934 | 0.165846 | 1.000000 | 0.835177 | 0.514177 | NaN | NaN |
| hip | 0.093364 | 0.138223 | -0.223837 | 0.208902 | 0.004675 | -0.095906 | 0.826985 | 0.881728 | 0.155321 | 0.143898 | 0.835177 | 1.000000 | -0.037664 | NaN | NaN |
| Waist/hip ratio | 0.091847 | 0.185117 | -0.158777 | 0.243329 | 0.275188 | 0.252548 | 0.250461 | 0.100873 | 0.137871 | 0.077918 | 0.514177 | -0.037664 | 1.000000 | NaN | NaN |
| Unnamed: 16 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Unnamed: 17 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
df.isnull().sum()
Cholesterol 0 Glucose 0 HDL Chol 0 Chol/HDL ratio 0 Age 0 Gender 0 Height 0 Weight 0 BMI 0 Systolic BP 0 Diastolic BP 0 waist 0 hip 0 Waist/hip ratio 0 Diabetes 0 Unnamed: 16 389 Unnamed: 17 389 dtype: int64
The data has no missing values except for the last 2 unnamed features, which will be dropped due to the large number of null values.
df= df.drop(['Unnamed: 16','Unnamed: 17'], axis=1)
df.columns
Index(['Cholesterol', 'Glucose', 'HDL Chol', 'Chol/HDL ratio', 'Age', 'Gender',
'Height', 'Weight', 'BMI', 'Systolic BP', 'Diastolic BP', 'waist',
'hip', 'Waist/hip ratio', 'Diabetes'],
dtype='object')
unique_values= df.apply(lambda x:x.nunique())
unique_values
Cholesterol 153 Glucose 116 HDL Chol 75 Chol/HDL ratio 69 Age 68 Gender 2 Height 22 Weight 139 BMI 193 Systolic BP 71 Diastolic BP 56 waist 30 hip 32 Waist/hip ratio 39 Diabetes 2 dtype: int64
From this and a critical look at the data, the 'Gender' and 'Diabetes' columns are nominal features and must be changed to categorical data type.
df['Gender'] =df['Gender'].astype('category')
df['Diabetes'] =df['Diabetes'].astype('category')
df['Diabetes'].dtypes
CategoricalDtype(categories=['Diabetes', 'No diabetes'], ordered=False)
plt.figure(figsize=(5,4))
sns.distplot(df.Cholesterol)
plt.title('Cholesterol Level Distribution(mg/dL)', fontdict={'fontweight': 'bold'})
#Classifying the cholesterol level.
desirable = df[df.Cholesterol < 200].count()[0]
borderline = df[(df.Cholesterol >= 200) & (df.Cholesterol < 240)].count()[0]
high = df[df.Cholesterol >= 240].count()[0]
print(f'Desirable Range (below 200 mg/dL): {desirable} patients.')
print(f'Borderline High Range (200-239 mg/dL): {borderline} patients.')
print(f'High Range (240 mg/dL and above): {high} patients.')
Desirable Range (below 200 mg/dL): 184 patients. Borderline High Range (200-239 mg/dL): 130 patients. High Range (240 mg/dL and above): 76 patients.
plt.figure(figsize=(5,4))
sns.distplot(df.Glucose)
plt.title('Glucose Level Distribution(mg/dL) ',fontdict={'fontweight': 'bold'})
desirable = df[df.Glucose < 100].count()[0]
borderline = df[(df.Glucose >= 100) & (df.Glucose < 125)].count()[0]
diabetes = df[df.Glucose >= 126].count()[0]
print(f'Normal Range (below 100 mg/dL): {desirable} patients.')
print(f'Prediabetes Range (100-125 mg/dL): {borderline} patients.')
print(f'Diabetes Range (126 mg/dL and above): {diabetes} patients.')
Normal Range (below 100 mg/dL): 260 patients. Prediabetes Range (100-125 mg/dL): 69 patients. Diabetes Range (126 mg/dL and above): 60 patients.
plt.figure(figsize=(5,4))
sns.distplot(df['HDL Chol'])
plt.title('High Density Lipoprotein Cholesterol Level Distribution(mg/dL)',fontdict={'fontweight': 'bold'})
desirable_men = df[(df['HDL Chol'] >= 40) & (df['HDL Chol'] < 60) ].count()[0]
desirable_women = df[(df['HDL Chol'] >= 50) & (df['HDL Chol'] < 60)].count()[0]
low_men = df[df['HDL Chol'] <40 ].count()[0]
low_women= df[df['HDL Chol'] <50 ].count()[0]
high = df[df['HDL Chol'] >= 60].count()[0]
print(f'Low Range for men (below 40 mg/dL): {low_men} patients.')
print(f'Low Range for women (below 50 mg/dL): {low_women} patients.')
print(f'Normal HDL Range for men (40-59 mg/dL): {desirable_men} patients.')
print(f'Normal HDL Range for women(50-59 mg/dL): {desirable_women} patients.')
print(f'High HDL Range(60 mg/dL and above): {high} patients.')
Low Range for men (below 40 mg/dL): 108 patients. Low Range for women (below 50 mg/dL): 226 patients. Normal HDL Range for men (40-59 mg/dL): 189 patients. Normal HDL Range for women(50-59 mg/dL): 71 patients. High HDL Range(60 mg/dL and above): 93 patients.
plt.figure(figsize=(5,4))
sns.distplot(df.Age)
plt.title('Age Distribution(yrs)',fontdict={'fontweight': 'bold'})
Text(0.5, 1.0, 'Age Distribution(yrs)')
plt.figure(figsize=(3,7))
sns.histplot(df.Gender)
plt.title('Gender Distribution',fontdict={'fontweight': 'bold'})
Text(0.5, 1.0, 'Gender Distribution')
plt.figure(figsize=(5,4))
sns.distplot(df.Height)
plt.title('Height Distribution(inches)',fontdict={'fontweight': 'bold'})
Text(0.5, 1.0, 'Height Distribution(inches)')
plt.figure(figsize=(5,4))
sns.distplot(df.Weight)
plt.title('Weight Distribution(pounds(lbs))',fontdict={'fontweight': 'bold'})
Text(0.5, 1.0, 'Weight Distribution(pounds(lbs))')
plt.figure(figsize=(5,4))
sns.distplot(df.BMI)
plt.title('Body Mass Index Distribution(lbs/(in)^2 )',fontdict={'fontweight': 'bold'})
underweight = df[df['BMI'] <18.5].count()[0]
normal_weight = df[(df['BMI'] >= 18.5) & (df['BMI'] < 24.9)].count()[0]
overweight = df[(df['BMI'] >=25.5) & (df['BMI'] <30 )].count()[0]
obese= df[df['BMI'] >=30 ].count()[0]
print(f'Underweight (below 18.5 lbs/in^2): {underweight} patients.')
print(f'Normal weight (18.5-24.9 lbs/in^2): {normal_weight} patients.')
print(f'Overweight (25-29.9 lbs/in^2): {overweight} patients.')
print(f'Obese (30 lbs/in^2 and above): {obese} patients.')
Underweight (below 18.5 lbs/in^2): 9 patients. Normal weight (18.5-24.9 lbs/in^2): 106 patients. Overweight (25-29.9 lbs/in^2): 107 patients. Obese (30 lbs/in^2 and above): 150 patients.
The BMI classes are as follows: underweight (BMI less than 18.5), normal weight (BMI between 18.5 and 24.9), overweight (BMI between 25 and 29.9), obesity (Class I, BMI between 30 and 34.9), obesity (Class II, BMI between 35 and 39.9), and obesity (Class III, BMI 40 or higher)
plt.figure(figsize=(5,4))
sns.distplot(df['Systolic BP'])
plt.title('Systolic Blood Pressure Distribution(mmHg)',fontdict={'fontweight': 'bold'})
Text(0.5, 1.0, 'Systolic Blood Pressure Distribution(mmHg)')
The systolic blood pressure (SBP) categories are as follows: normal (SBP below 120 mmHg but not up to 90mmHg), elevated (SBP between 120 and 129 mmHg), hypertension stage 1 (SBP between 130 and 139 mmHg), hypertension stage 2 (SBP 140 mmHg or higher), and hypertensive crisis (SBP higher than 180 mmHg and/or diastolic blood pressure (DBP) higher than 120 mmHg). This data has few portion below 120mmHg and the rest above.
plt.figure(figsize=(5,4))
sns.distplot(df['Diastolic BP'])
plt.title('Diastolic Blood Pressure Distribution(mm/Hg)',fontdict={'fontweight': 'bold'})
Text(0.5, 1.0, 'Diastolic Blood Pressure Distribution(mm/Hg)')
The diastolic blood pressure (DBP) categories are as follows: normal (DBP below 80 mmHg b ut not up to 60mmHg), elevated (DBP between 80 and 89 mmHg), hypertension stage 1 (DBP between 90 and 99 mmHg), hypertension stage 2 (DBP 100 mmHg or higher), and hypertensive crisis (DBP higher than 120 mmHg and/or systolic blood pressure (SBP) higher than 180 mmHg).
plt.figure(figsize=(5,4))
sns.distplot(df['waist'])
plt.title('Waist Size Distribution(inches)',fontdict={'fontweight': 'bold'})
Text(0.5, 1.0, 'Waist Size Distribution(inches)')
plt.figure(figsize=(5,4))
sns.distplot(df['hip'])
plt.title('Hip Size Distribution(inches)',fontdict={'fontweight': 'bold'})
Text(0.5, 1.0, 'Hip Size Distribution(inches)')
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df['Diabetes'] = label_encoder.fit_transform(df['Diabetes'])
df['Gender'] = label_encoder.fit_transform(df['Gender'])
label_mapping = {0: 1, 1: 0}
df['Diabetes']=df['Diabetes'].map(label_mapping)
df.head()
| Cholesterol | Glucose | HDL Chol | Chol/HDL ratio | Age | Gender | Height | Weight | BMI | Systolic BP | Diastolic BP | waist | hip | Waist/hip ratio | Diabetes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Patient number | |||||||||||||||
| 1 | 193 | 77 | 49 | 3.9 | 19 | 0 | 61 | 119 | 22.5 | 118 | 70 | 32 | 38 | 0.84 | 0 |
| 2 | 146 | 79 | 41 | 3.6 | 19 | 0 | 60 | 135 | 26.4 | 108 | 58 | 33 | 40 | 0.83 | 0 |
| 3 | 217 | 75 | 54 | 4.0 | 20 | 0 | 67 | 187 | 29.3 | 110 | 72 | 40 | 45 | 0.89 | 0 |
| 4 | 226 | 97 | 70 | 3.2 | 20 | 0 | 64 | 114 | 19.6 | 122 | 64 | 31 | 39 | 0.79 | 0 |
| 5 | 164 | 91 | 67 | 2.4 | 20 | 0 | 70 | 141 | 20.2 | 122 | 86 | 32 | 39 | 0.82 | 0 |
data_correlation= df.corr()
plt.figure(figsize= (7,5))
plt.title('Correlation Heatmap', fontdict= {'fontweight':'bold'})
sns.heatmap(data_correlation,annot=True,fmt='.0%' , annot_kws= {'fontsize': 7})
<AxesSubplot: title={'center': 'Correlation Heatmap'}>
#cat= df['Diabetes_encoded','Gender_encoded']
#data= df.drop(cat)
plt.figure(figsize= (15,15))
sns.pairplot(df)
<seaborn.axisgrid.PairGrid at 0x282415ddd0>
<Figure size 1500x1500 with 0 Axes>
Let us take a look at features with very good correlation.
correlation_matrix = df.corr()
high_correlation_pairs = []
for i in range(len(correlation_matrix.columns)):
for j in range(i + 1, len(correlation_matrix.columns)):
if abs(correlation_matrix.iloc[i, j]) > 0.30:
pair = (correlation_matrix.columns[i], correlation_matrix.columns[j])
high_correlation_pairs.append(pair)
print(high_correlation_pairs)
[('Cholesterol', 'Chol/HDL ratio'), ('Glucose', 'Diabetes'), ('HDL Chol', 'Chol/HDL ratio'), ('Chol/HDL ratio', 'waist'), ('Age', 'Systolic BP'), ('Age', 'Diabetes'), ('Gender', 'Height'), ('Gender', 'Waist/hip ratio'), ('Weight', 'BMI'), ('Weight', 'waist'), ('Weight', 'hip'), ('BMI', 'waist'), ('BMI', 'hip'), ('Systolic BP', 'Diastolic BP'), ('waist', 'hip'), ('waist', 'Waist/hip ratio')]
The pairs above show features that have correlation coeffient above plus or minus 0.3 .
correlation_matrix = df.corr()
target_correlation = correlation_matrix['Diabetes']
target_correlation = target_correlation.abs().sort_values(ascending=True)
target_correlation = target_correlation.drop('Diabetes')
plt.figure(figsize=(5, 4))
plt.barh(target_correlation.index, target_correlation.values)
plt.xlabel('Correlation')
plt.ylabel('Features')
plt.title('Correlation between Features and Diabetes')
plt.show()
From the graph we can see that glucose level has the highest correlation coeffeicient with Diabetes, then surprisingly, age through to height having the lowest correlation coefficient with Diabetes.
Since the age seems to have a good correlation, let us utilise it in exploring new features with higher correlation.
#creating square of the ages feature
age_square= (df.Age)**2
New_features= pd.DataFrame()
New_features['Age_square']= age_square
Waist seems to have a good correlation with Diabetes, I am creating a new feature ((waist)^2/Age) to see how it will correlate with diabetes since a relationship between a person's waist and age can reveals obesity.
Square_Waist_per_Age= df['waist']**2/df['Age']
New_features['Square_Waist/Age']= Square_Waist_per_Age
Creating another feature which is square root of (Glucose level * Weight * (Chol/HDL))
GWC= np.sqrt(df.Glucose * df.Weight * df['Chol/HDL ratio'])
New_features['GWC']= GWC
New_features['Diabetes']= df.Diabetes
New_features['Glucose']=df.Glucose
New_features['BMI']=df.BMI
New_features['Chol/HDL ratio'] =df['Chol/HDL ratio']
New_features['Age']= df.Age
corr_matrix = New_features.corr()
target_correlation = corr_matrix['Diabetes']
target_correlation = target_correlation.abs().sort_values(ascending=True)
target_correlation = target_correlation.drop('Diabetes')
plt.figure(figsize=(5, 4))
color=['orange','grey','grey','orange','grey','orange','grey']
plt.barh(target_correlation.index, target_correlation.values, color=color)
plt.xlabel('Correlation')
plt.ylabel('Features')
plt.title('Correlation between New Features and Diabetes')
plt.show()
print(f'Correlation Coefficient of GWC is {target_correlation.GWC} whilst that of glucose is {target_correlation.Glucose}')
Correlation Coefficient of GWC is 0.5975911038445685 whilst that of glucose is 0.6890795038664445
From the graph, it could be revealed that the new feature GWC which is the square root of (Weight * Glucose Level * (cholesterol level/HDL level) tends to be of a very high correlation coefficient with Diabetes and can be a very useful factor in diagnosing diabetes even over most of the the traditional factors that have always been used. The feature GWC makes use of glucose level, cholesterol level, weight and HDL level to give a more accurate foresight concerning the likelihood of the patient being diabetic or not. From analysis, GWC above 200 calls for attention since it is more likely patient is diabetic.
data= New_features.merge(df)
X = data.drop(columns ='Diabetes')
y = data.Diabetes
X_train,X_test,y_train,y_valid= train_test_split(X,y,test_size=0.2,random_state=0)
my_model= RandomForestClassifier(n_estimators=500 , max_depth= 100, random_state=0)
my_model.fit(X_train,y_train)
predictions = my_model.predict(X_test)
efficiency = my_model.score(X_test, y_valid)
mae= mean_absolute_error(y_valid,predictions)
print('model accuracy:', efficiency)
print('mean absolute error:', mae)
model accuracy: 0.9230769230769231 mean absolute error: 0.07692307692307693
feature= np.array([[2400,29,350,83,24,(200/39),45,200,39,0,52,170,122,75,33,42,(33/42)]])
my_model.predict(feature)
array([0], dtype=int64)
import pandas as pd
import numpy as np
def get_numeric_input(prompt):
while True:
try:
value = int(input(prompt))
return value
except ValueError:
print("Invalid input. Please enter a numeric value.")
def predict_diabetes():
age = get_numeric_input('Age of patient: ')
age_sqr = age ** 2
waist = get_numeric_input('Waist size in Inches: ')
weight = get_numeric_input('Weight of patient in lbs: ')
sqr_waist_per_age = (waist ** 2) / age
glucose = get_numeric_input('Fasting glucose level in mg/dL: ')
cholesterol = get_numeric_input('Cholesterol level in mg/dL: ')
HDL = get_numeric_input('HDL Cholesterol level in mg/dL: ')
GWC = np.sqrt(glucose * weight * (cholesterol / HDL))
while True:
try:
gender = int(input("Gender (Input '1' for male or '0' for female): "))
gender = str(gender)
if gender not in ('0', '1'):
raise ValueError
break
except ValueError:
print("Invalid input. Please enter either '0' or '1'.")
height = get_numeric_input('Height of patient in inches: ')
BMI = 703 * (weight / (height ** 2))
systolic_bp = get_numeric_input('Systolic Blood Pressure: ')
diastolic_bp = get_numeric_input('Diastolic Blood Pressure: ')
hip = get_numeric_input('Hip size in inches: ')
waist_per_hip = waist / hip
cholesterol_per_HDL = cholesterol / HDL
# Forming a dataframe for the model
data = {
'Age_square': [age_sqr],
'Square_Waist/Age': [sqr_waist_per_age],
'GWC': [GWC],
'Glucose': [glucose],
'BMI': [BMI],
'Chol/HDL ratio': [cholesterol_per_HDL],
'Age': [age],
'Cholesterol': [cholesterol],
'HDL Chol': [HDL],
'Gender': [gender],
'Height': [height],
'Weight': [weight],
'Systolic BP': [systolic_bp],
'Diastolic BP': [diastolic_bp],
'waist': [waist],
'hip': [hip],
'Waist/hip ratio': [waist_per_hip]
}
df = pd.DataFrame(data)
prediction = my_model.predict(df)
if prediction == 1:
return "This patient is likely to be diabetic."
else:
return "This patient is likely to be undiabetic."
result = predict_diabetes()
print(result)
Age of patient: 43 Waist size in Inches: 41 Weight of patient in lbs: 200 Fasting glucose level in mg/dL: 127 Cholesterol level in mg/dL: 293 HDL Cholesterol level in mg/dL: 60 Gender (Input '1' for male or '0' for female): 1 Height of patient in inches: 65 Systolic Blood Pressure: 141 Diastolic Blood Pressure: 97 Hip size in inches: 39 This patient is likely to be undiabetic.
result = predict_diabetes()
print(result)